feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
Conversation
Add rocm_path_resolver and shared therock_markers; wire Context, run CLI, validators, and orchestrator. Document in README, configuration, and usage; update CHANGELOG. Extend unit and integration tests. Adjust therock_detector (sys.path helper) without ADR docs in this commit. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Adds robust ROCm root resolution for madengine run, including auto-detection for versioned /opt/rocm-* and TheRock layouts, plus explicit host/container override semantics via MAD_ROCM_PATH.
Changes:
- Introduces ROCm root detection/normalization helpers (
rocm_path_resolver) and shared TheRock file-marker helpers (therock_markers). - Wires host/container ROCm-path precedence through
Context,run_orchestrator, CLI validation, and help text. - Updates docs/tests (unit + integration) to reflect the new resolution behavior and TheRock markers.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit tests for TheRock marker path helpers and detection. |
| tests/unit/test_rocm_path.py | Adds/extends unit tests for host/container ROCm path precedence and auto-detect behavior. |
| tests/integration/test_gpu_management.py | Adjusts integration tests to patch the correct vendor-detection import path and stabilizes a Context-based test with mocks. |
| src/madengine/utils/therock_markers.py | Adds canonical TheRock marker relative paths and helper functions. |
| src/madengine/utils/rocm_path_resolver.py | Implements host/container ROCm root resolution, auto-detect heuristics, and legacy compatibility functions. |
| src/madengine/scripts/common/tools/therock_detector.py | Reuses shared TheRock marker helpers and improves importability when run as a standalone script. |
| src/madengine/orchestration/run_orchestrator.py | Defers ROCm path resolution to Context instead of pre-resolving in the orchestrator. |
| src/madengine/core/context.py | Uses new resolver functions to compute host ROCm path and set container ROCM_PATH consistently. |
| src/madengine/core/constants.py | Keeps get_rocm_path() as a legacy (non-auto-detect) wrapper delegating to the new resolver’s legacy function. |
| src/madengine/cli/validators.py | Adds schema validation for host/container MAD_ROCM_PATH values in additional context. |
| src/madengine/cli/commands/run.py | Updates --rocm-path help text to reflect new precedence/auto-detect behavior. |
| docs/usage.md | Documents host vs container ROCm root overrides and auto-detect disabling. |
| docs/configuration.md | Documents new precedence rules and override mechanisms for ROCm root resolution. |
| README.md | Updates top-level docs to reflect new ROCm-path behavior and override knobs. |
| CHANGELOG.md | Notes the new ROCm auto-detect + override behavior and references a design doc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… checks - csv_parser.py: resolve rocm-smi via shutil.which() so TheRock images (where tools live in a Python venv, not /opt/rocm/bin/) are detected correctly; accept path_resolver to get ROCm version without hardcoding /opt/rocm/.info/version; add bounds check in NVIDIA GPU info parser - rocenv_tool.py: pass path_resolver to CSVParser so version resolution uses RocmPathResolver.get_version() for both TheRock and traditional installs - container_runner.py: replace host-resolved hardcoded amd-smi/rocm-smi paths in container exec with PATH-based resolution; add run phase environment table showing host vs container installation type, ROCm/CUDA root, and version side-by-side (supports apt install and therock layouts) - docs: document run phase environment table in usage.md and configuration.md Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a centralized ROCm install-root resolution flow for the run phase (host + container), including auto-detection of versioned /opt/rocm-* and TheRock marker layouts, and wires it through context creation, container execution, CLI validation, and documentation.
Changes:
- Introduces
madengine.utils.rocm_path_resolver+madengine.utils.therock_markersand updatesContextto resolve host ROCm root and propagate containerROCM_PATHwith clear override precedence. - Updates runtime execution (container runner + rocEnvTool CSV output) to better handle TheRock-style layouts where tools are discovered via
PATH. - Adds unit tests for TheRock markers and ROCm path resolution, adjusts integration tests, and updates CLI help + docs/README/CHANGELOG to document the new behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit coverage for TheRock marker-path helpers. |
| tests/unit/test_rocm_path.py | Adds unit coverage for new host/container ROCm path resolution behavior and TheRock layouts. |
| tests/integration/test_gpu_management.py | Updates integration mocking/patch points to match current vendor detection paths and stabilizes Context initialization in tests. |
| src/madengine/utils/therock_markers.py | Defines shared TheRock on-disk marker locations and detection helper. |
| src/madengine/utils/rocm_path_resolver.py | Implements host/container ROCm root resolution, auto-detect, and legacy compatibility functions. |
| src/madengine/scripts/common/tools/therock_detector.py | Reuses shared marker helpers; improves importability when run as a standalone script. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py | Passes path resolver into CSVParser so version resolution follows detected ROCm root. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py | Uses PATH discovery for rocm-smi and uses resolver-based version when available. |
| src/madengine/orchestration/run_orchestrator.py | Stops pre-resolving ROCm path in orchestrator; delegates to Context/resolver logic. |
| src/madengine/execution/container_runner.py | Uses in-container PATH for SMI tools and prints a host vs container environment summary table. |
| src/madengine/core/context.py | Centralizes ROCm path resolution and container ROCM_PATH propagation during runtime context init. |
| src/madengine/core/constants.py | Reframes get_rocm_path() as legacy behavior and delegates to get_rocm_path_legacy(). |
| src/madengine/cli/validators.py | Validates new MAD_ROCM_PATH fields in additional context (top-level + docker_env_vars). |
| src/madengine/cli/commands/run.py | Updates --rocm-path help to match new precedence/auto-detect behavior. |
| docs/usage.md | Documents host/container override behavior and the new run-phase environment table output. |
| docs/configuration.md | Documents precedence, overrides, and references ADR for ROCm path resolution. |
| README.md | Updates high-level ROCm path messaging to match new auto-detect + MAD_ROCM_PATH override model. |
| CHANGELOG.md | Records new ROCm path resolution behavior and references design documentation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…able The previous shell command embedded \$(rocm-sdk path --root) inside [ -f "..." ], which broke quoting when passed via docker exec bash -c "..." causing the test to always fall through to 'unknown'. Replace with a quoting-safe check: if rocm-sdk command exists and returns a root path, report 'therock'; otherwise check /opt/rocm/.info/version for traditional apt install layout. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Apply docker_env_vars overrides in context; finalize in run_container (AMD) from OCI image env, then docker run probe, then /opt/rocm with warning - Update docs, README, CLI reference, CHANGELOG, and unit tests - ADR 0001 left out of this commit (untracked) Made-with: Cursor
…to coketaste/v2-rocm-path
There was a problem hiding this comment.
Pull request overview
Adds robust ROCm root resolution for the run phase, including host-side auto-detection (TheRock + versioned /opt/rocm-*) and explicit host/container override behavior, and updates runtime logging + docs/tests accordingly.
Changes:
- Introduces
madengine.utils.rocm_path_resolver+madengine.utils.therock_markersto centralize ROCm/TheRock detection and documented precedence. - Updates runtime wiring (Context + ContainerRunner) so host ROCm root is resolved early, while container
ROCM_PATHis finalized at Docker run time (OCI env → probe → fallback). - Adds/updates unit + integration tests and updates documentation/README/CHANGELOG to reflect the new precedence and behaviors.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/rocm_path_resolver.py |
New host/container ROCm path resolution logic (auto-detect, overrides, OCI/probe finalization). |
src/madengine/utils/therock_markers.py |
New shared TheRock marker path helpers. |
src/madengine/core/context.py |
Uses new host resolver; stops mirroring host ROCm root into container env at init; applies container override mapping only. |
src/madengine/execution/container_runner.py |
Finalizes container ROCM_PATH at run time; prints host/container environment summary; uses PATH-based SMI tool invocation in container. |
src/madengine/orchestration/run_orchestrator.py |
Passes --rocm-path through to Context without legacy resolution. |
src/madengine/core/constants.py |
get_rocm_path() becomes legacy wrapper delegating to resolver’s legacy behavior. |
src/madengine/cli/validators.py |
Validates MAD_ROCM_PATH in top-level context and docker_env_vars. |
src/madengine/cli/commands/run.py |
Updates --rocm-path help text to describe new host behavior + auto-detect toggle. |
src/madengine/scripts/common/tools/therock_detector.py |
Imports TheRock marker helpers reliably when run as a file. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py |
Threads path_resolver into CSV parsing. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py |
Uses PATH-based rocm-smi detection; uses resolver for ROCm version; adds parsing guard. |
tests/unit/test_rocm_path.py |
Adds coverage for new resolver APIs and revised Context/container behavior. |
tests/unit/test_therock_markers.py |
Adds unit tests for TheRock marker helpers. |
tests/integration/test_gpu_management.py |
Fixes patch target for vendor auto-detection; stabilizes non-AMD context creation in integration tests. |
docs/usage.md |
Documents host vs container ROCm path behavior + new run-phase environment table. |
docs/configuration.md |
Documents new precedence and container finalization behavior (OCI/probe). |
docs/installation.md |
Clarifies host vs container ROCm path configuration. |
docs/cli-reference.md |
Updates CLI reference for --rocm-path and env vars. |
README.md |
Updates high-level ROCm path behavior summary. |
CHANGELOG.md |
Notes behavior change and new resolver modules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
MI350X (gfx950) was missing from the skip_gpu_arch fixture, causing test_commandline_argument_skip_gpu_arch to fail on this machine. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…TH in --additional-context
The --rocm-path flag was an alias for top-level MAD_ROCM_PATH but its
help text implied it could set both host and container paths, risking
confusion when environments differ. Users should set host and container
ROCm roots independently via --additional-context:
Host: {"MAD_ROCM_PATH": "/path/to/host/rocm"}
Container: {"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}
Also fixes datetime.utcnow() deprecation warnings in mongodb.py
(replaced with datetime.now(timezone.utc)).
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new ROCm root resolution flow that can auto-detect host ROCm installs (including versioned /opt/rocm-* and TheRock marker layouts) and cleanly separates host ROCm resolution from in-container ROCM_PATH selection at Docker run time.
Changes:
- Added
madengine.utils.rocm_path_resolver(host auto-detect + container finalization) andmadengine.utils.therock_markers(shared TheRock marker paths). - Updated runtime wiring so
Contextresolves host ROCm root via the resolver, whileContainerRunnerfinalizes containerROCM_PATHat run time (OCI env → probe → default). - Removed
madengine run --rocm-pathusage in favor ofMAD_ROCM_PATH(top-level anddocker_env_vars), and updated docs/tests accordingly.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/rocm_path_resolver.py |
Implements host ROCm auto-detect and container ROCM_PATH override/finalization logic. |
src/madengine/utils/therock_markers.py |
Defines canonical TheRock marker relative paths + helper predicates. |
src/madengine/core/context.py |
Switches host ROCm path resolution to the new resolver and defers container ROCM_PATH finalization. |
src/madengine/execution/container_runner.py |
Finalizes in-container ROCM_PATH during Docker runs and adds a host/container environment summary table. |
src/madengine/core/constants.py |
Reframes get_rocm_path() as legacy resolution and delegates to resolver legacy helper. |
src/madengine/cli/commands/run.py |
Removes --rocm-path CLI option wiring. |
src/madengine/cli/validators.py |
Validates MAD_ROCM_PATH (host and docker_env_vars) shapes in additional context. |
src/madengine/scripts/common/tools/therock_detector.py |
Makes the tool importable when executed as a file; reuses shared marker helpers. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/{rocenv_tool.py,csv_parser.py} |
Passes resolver into CSV parsing and improves tool/path selection for TheRock-style layouts. |
tests/unit/test_rocm_path.py |
Adds/updates unit coverage for resolver behaviors and container override rules. |
tests/unit/test_therock_markers.py |
Adds unit coverage for TheRock marker helpers. |
tests/integration/test_gpu_management.py |
Updates patch points and stabilizes integration behavior with Context initialization changes. |
docs/{usage.md,installation.md,configuration.md,cli-reference.md} |
Updates documented precedence and configuration guidance for host vs container ROCm roots. |
README.md |
Updates top-level ROCm path guidance and examples. |
CHANGELOG.md |
Adds an Unreleased entry describing new ROCm resolution behavior. |
src/madengine/database/mongodb.py |
Switches metadata timestamps to timezone-aware UTC. |
tests/fixtures/dummy/models.json |
Adds gfx950 to skip list for a dummy model. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…alongside is_therock_tree) Conflict in container_runner.py: HEAD added is_therock_tree import for TheRock environment table; develop added PERFORMANCE_LOG_PATTERN import from deployment.base. Both imports are used in the file — keep both. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- validators.py: restrict docker_env_vars.MAD_ROCM_PATH to str|None only (was allowing int/float, inconsistent with error message) - rocm_path_resolver.py: always pop MAD_ROCM_PATH from docker_env even when None, preventing the key from leaking into docker run env - rocm_path_resolver.py: add bin/rocm-smi to _CONTAINER_PROBE_SH try_root() predicate to match Python-side looks_like_rocm_root() - container_runner.py: clear ROCM_PATH before finalize_container_rocm_path each run so multi-model runs re-resolve per docker_image instead of reusing image 1's path for all subsequent images - csv_parser.py: shlex.quote() the rocm-smi path before shell interpolation; only increment num_gpu after successful NVIDIA line parse (git-ignored path, committed separately if needed) - README.md: replace curly/smart quotes with ASCII quotes in bash examples - CHANGELOG.md: remove stale --rocm-path alias reference (flag was removed); remove broken docs/adr/ link Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm root resolution system to better support TheRock/versioned ROCm layouts and clearly separate host vs container ROCm path overrides, wiring it through context initialization and Docker run-time environment setup.
Changes:
- Introduces
rocm_path_resolver(host auto-detect + container finalization) andtherock_markers(shared marker paths). - Updates runtime flow to resolve host ROCm root in
Contextand set containerROCM_PATHat Docker run-time (OCI env → probe → default), plus prints a host/container environment summary table. - Removes
--rocm-pathand updates validators/tests/docs to useMAD_ROCM_PATH(host) anddocker_env_vars.MAD_ROCM_PATH(container).
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/rocm_path_resolver.py |
Implements host ROCm auto-detect and container ROCM_PATH finalization logic. |
src/madengine/utils/therock_markers.py |
Adds shared TheRock marker path helpers and detection. |
src/madengine/core/context.py |
Uses new host ROCm resolver; stops mirroring host ROCM_PATH into container env at init. |
src/madengine/execution/container_runner.py |
Finalizes container ROCM_PATH at run-time and prints host/container env summary. |
src/madengine/core/constants.py |
Reframes get_rocm_path() as legacy (no auto-detect) and delegates to resolver legacy helper. |
src/madengine/cli/commands/run.py |
Removes --rocm-path option and associated wiring. |
src/madengine/cli/validators.py |
Validates MAD_ROCM_PATH types in additional context (host + docker_env_vars). |
src/madengine/orchestration/run_orchestrator.py |
Stops passing legacy rocm_path into Context. |
src/madengine/scripts/common/tools/therock_detector.py |
Reuses shared marker helpers; makes imports work when run as a file. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py |
Passes resolver into CSV generation for version lookup. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py |
Improves tool discovery (PATH-based) and ROCm version collection fallback. |
src/madengine/database/mongodb.py |
Stores upload timestamps using timezone-aware UTC datetimes. |
tests/unit/test_rocm_path.py |
Adds/updates unit tests for resolver behavior and Context integration. |
tests/unit/test_therock_markers.py |
Adds unit tests for TheRock marker helpers. |
tests/integration/test_gpu_management.py |
Updates mocks/patch targets for vendor auto-detection and Context init behavior. |
tests/fixtures/dummy/models.json |
Updates dummy fixture skip list to include gfx950. |
docs/configuration.md |
Documents new ROCm path precedence and overrides. |
docs/usage.md |
Documents host/container ROCm path behavior and new env summary table. |
docs/installation.md |
Updates install guidance to MAD_ROCM_PATH / container override behavior. |
docs/cli-reference.md |
Removes --rocm-path references; documents new env/override keys. |
README.md |
Updates top-level ROCm path feature description and examples. |
CHANGELOG.md |
Notes new default host auto-detect + container ROCM_PATH resolution behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…n logic - Remove broken docs/adr/ links from configuration.md, usage.md, rocm_path_resolver.py docstring, and therock_markers.py comment - Remove resolve_container_rocm_path() (thin wrapper with no unique logic) - Remove early apply_container_rocm_path_overrides() call in init_gpu_context: it set ROCM_PATH only for run_container to immediately pop it; container ROCM_PATH finalization now happens exclusively in finalize_container_rocm_path - Fix auto_detect() inner loop: first iteration was a duplicate _looks_like_rocm_root(root) check already done before the loop; walk to root.parent first, then check - Silence grep -oP stderr in container ROCm version command (BusyBox images) - Guard whitespace-only ROCM_PATH in apply_container_rocm_path_overrides - Update tests to match: MAD_ROCM_PATH preserved in docker_env_vars at context init, consumed at run time by finalize_container_rocm_path All 383 unit tests pass. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
) context.py had a conflict in Context.__init__: develop added detect_local_gpu_arch (for GPU arch auto-detection in full-run mode) and rocm_path params; the PR branch dropped both in favour of resolve_host_rocm_path(). Resolution: keep detect_local_gpu_arch (still referenced by init_build_context) and drop rocm_path (replaced by the PR's automatic host ROCm path resolution via resolve_host_rocm_path). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds robust ROCm root resolution to better support non-/opt/rocm layouts (notably TheRock / versioned installs) and cleanly separates host ROCm detection from in-container ROCM_PATH selection at Docker run time.
Changes:
- Introduces
madengine.utils.rocm_path_resolver(host auto-detect + container finalization) and shared TheRock marker helpers inmadengine.utils.therock_markers. - Wires host ROCm resolution through
Contextand run execution, and finalizes containerROCM_PATHduring Docker runs (OCI env → in-container probe → default). - Updates tests and documentation to reflect the new override keys (
MAD_ROCM_PATH,docker_env_vars.MAD_ROCM_PATH) and the new run-time environment reporting.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit coverage for TheRock marker-path helpers and detection. |
| tests/unit/test_rocm_path.py | Expands unit coverage for host/container ROCm resolution behavior and precedence. |
| tests/integration/test_gpu_management.py | Updates integration tests to patch the correct vendor auto-detect import and stabilizes Context init in non-AMD scenarios. |
| tests/fixtures/dummy/models.json | Updates dummy model GPU-arch skip list (adds gfx950). |
| src/madengine/utils/therock_markers.py | Defines canonical TheRock marker paths and a minimal is_therock_tree() check. |
| src/madengine/utils/rocm_path_resolver.py | Implements host ROCm auto-detect + container ROCM_PATH finalization (OCI/env/probe/default) and legacy wrappers. |
| src/madengine/scripts/common/tools/therock_detector.py | Makes TheRock detection script importable when run as a file by amending sys.path; uses shared marker helpers with fallback. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py | Passes a resolver into CSV parsing so ROCm version reporting works across layouts. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py | Uses PATH-based tool discovery and resolver-based ROCm version retrieval; adds minor robustness to parsing. |
| src/madengine/orchestration/run_orchestrator.py | Removes explicit --rocm-path plumbing and relies on Context host resolution. |
| src/madengine/execution/container_runner.py | Finalizes in-container ROCM_PATH at run time for AMD and prints a host vs container environment summary. |
| src/madengine/database/mongodb.py | Switches metadata timestamps to timezone-aware UTC values. |
| src/madengine/core/context.py | Replaces legacy ROCm path selection with resolve_host_rocm_path() and stops mirroring host path into container env at init time. |
| src/madengine/core/constants.py | Reframes get_rocm_path() as legacy-only and delegates to resolver’s legacy helper. |
| src/madengine/cli/validators.py | Validates MAD_ROCM_PATH types in additional context (top-level + docker_env_vars). |
| src/madengine/cli/commands/run.py | Removes the --rocm-path CLI option and associated plumbing. |
| docs/usage.md | Documents host vs container ROCm path behavior and the new run-phase environment table. |
| docs/installation.md | Updates installation guidance for host ROCm overrides and clarifies container behavior. |
| docs/configuration.md | Documents new ROCm path precedence and container resolution policy (OCI/env/probe/default). |
| docs/cli-reference.md | Removes --rocm-path from CLI reference and documents new override keys and env vars. |
| README.md | Updates README to reflect host auto-detect and container ROCM_PATH behavior. |
| CHANGELOG.md | Notes the ROCm path resolution behavior change and the new override policy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
apply_container_rocm_path_overrides() was falling back to host_path when docker_env_vars.MAD_ROCM_PATH was None or blank, silently mirroring the host ROCm root into the container — contradicting the documented "no host mirroring by default" policy. Fix: after popping the key (always consumed), return None immediately for None/blank values instead of substituting host_path. Add test_apply_overrides_null_mad_rocm_path_does_not_mirror_host to cover None, "", and whitespace-only values. Resolves Copilot review comment on PR #110. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
The directory was untracked via git rm --cached in c6370bb but never added to .gitignore, leaving it one accidental "git add ." away from being re-committed. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds robust ROCm install-root resolution with separate host vs in-container handling, including TheRock marker support, updated CLI/docs, and test coverage updates across unit + integration tests.
Changes:
- Introduces host ROCm root auto-detection + overrides via top-level
MAD_ROCM_PATH, plus container-timeROCM_PATHfinalization (OCI env → probe → default). - Removes
--rocm-pathfrommadengine run, routing configuration through--additional-context(host +docker_env_vars). - Updates runtime logging/diagnostics (run env table), tooling scripts, validators, and test suites to match the new resolution behavior.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit tests validating TheRock marker-path helpers and detection behavior. |
| tests/unit/test_rocm_path.py | Expands unit coverage for host/container ROCm path resolution, MAD_ROCM_PATH precedence, and container finalization behavior. |
| tests/integration/test_gpu_management.py | Fixes patch target for vendor auto-detect and stabilizes a non-AMD renderD test setup. |
| tests/fixtures/dummy/models.json | Updates dummy fixture to skip gfx950 in skip_gpu_arch. |
| src/madengine/utils/therock_markers.py | Adds shared TheRock marker constants + helper functions for consistent detection. |
| src/madengine/utils/rocm_path_resolver.py | Implements host ROCm auto-detection, container ROCM_PATH finalization, and helper APIs. |
| src/madengine/scripts/common/tools/therock_detector.py | Improves script importability when executed as a file; reuses shared TheRock marker helpers. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py | Passes a path resolver into CSV parsing to support TheRock layouts/version lookup. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py | Uses PATH-based rocm-smi discovery and safer parsing for GPU info + ROCm version lookup. |
| src/madengine/orchestration/run_orchestrator.py | Stops threading a legacy rocm_path arg into Context; relies on Context resolution. |
| src/madengine/execution/container_runner.py | Finalizes container ROCM_PATH at run time and adds a host-vs-container environment summary table. |
| src/madengine/database/mongodb.py | Switches timestamps to timezone-aware UTC to avoid utcnow() deprecation warnings. |
| src/madengine/core/context.py | Resolves host ROCm root via resolve_host_rocm_path; stops mirroring host ROCM_PATH into container env at init. |
| src/madengine/core/constants.py | Re-scopes get_rocm_path() as legacy resolution (no auto-detect) delegating to the new resolver helper. |
| src/madengine/cli/validators.py | Validates MAD_ROCM_PATH types in additional context (host + docker_env_vars). |
| src/madengine/cli/commands/run.py | Removes --rocm-path flag from the run command. |
| docs/usage.md | Documents host vs container ROCm path behavior and the run-phase environment table output. |
| docs/installation.md | Updates guidance to use MAD_ROCM_PATH (host) and docker_env_vars.MAD_ROCM_PATH (container). |
| docs/configuration.md | Documents new resolution precedence and container-time ROCM_PATH finalization behavior. |
| docs/cli-reference.md | Removes --rocm-path from reference and updates examples for MAD_ROCM_PATH usage. |
| README.md | Updates top-level ROCm path messaging and examples to match new behavior. |
| CHANGELOG.md | Records new auto-detect behavior, container resolution, CLI flag removal, and related fixes. |
| .gitignore | Ignores run_directory/ artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm root resolution pipeline to clearly separate host ROCm root detection from in-container ROCM_PATH resolution, and removes the ambiguous --rocm-path CLI flag in favor of MAD_ROCM_PATH via --additional-context.
Changes:
- Introduces
rocm_path_resolver+therock_markersutilities and threads host ROCm root resolution throughContext. - Finalizes container
ROCM_PATHduring Docker run (OCI env → in-container probe → default), and updates runtime logging to show host vs container environment details. - Updates CLI/docs/changelog, adjusts TheRock detection tooling, and extends unit/integration tests accordingly.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit tests for TheRock marker path helpers. |
| tests/unit/test_rocm_path.py | Updates/extends unit tests for MAD_ROCM_PATH + auto-detect + container finalization behavior. |
| tests/integration/test_gpu_management.py | Fixes vendor auto-detect patch target; stabilizes Context integration assertions via mocking. |
| tests/fixtures/dummy/models.json | Adds gfx950 to the skip list for dummy fixture coverage. |
| src/madengine/utils/therock_markers.py | Adds shared TheRock marker path helpers (share/therock/...). |
| src/madengine/utils/rocm_path_resolver.py | Implements host auto-detect + container finalization logic (OCI env + probe) and override handling. |
| src/madengine/scripts/common/tools/therock_detector.py | Makes marker imports work when executed as a standalone file by adjusting sys.path; reuses marker helpers. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py | Passes path_resolver into CSVParser so ROCm version works for TheRock and traditional layouts. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py | Resolves rocm-smi via PATH first; uses resolver for version; adds bounds-checking in NVIDIA parsing. |
| src/madengine/orchestration/run_orchestrator.py | Removes --rocm-path wiring; relies on Context’s host ROCm resolution. |
| src/madengine/execution/container_runner.py | Finalizes in-container ROCM_PATH at runtime; switches AMD SMI checks to PATH-based; prints host vs container env table. |
| src/madengine/database/mongodb.py | Replaces datetime.utcnow() with datetime.now(timezone.utc). |
| src/madengine/core/context.py | Resolves host ROCm root via resolve_host_rocm_path; stops setting container ROCM_PATH during context init. |
| src/madengine/core/constants.py | Keeps get_rocm_path() as legacy (no auto-scan), delegating to resolver’s legacy helper. |
| src/madengine/cli/validators.py | Validates MAD_ROCM_PATH types in additional context (top-level and docker_env_vars). |
| src/madengine/cli/commands/run.py | Removes the --rocm-path option and associated plumbing. |
| docs/usage.md | Documents host vs container ROCm behavior and the new run-phase environment table. |
| docs/installation.md | Updates non-default ROCm guidance to use MAD_ROCM_PATH semantics and clarifies container behavior. |
| docs/configuration.md | Documents host auto-detect, override keys, and container precedence. |
| docs/cli-reference.md | Removes --rocm-path references; adds MAD_ROCM_PATH examples. |
| README.md | Updates top-level ROCm path guidance to match new host/container split. |
| CHANGELOG.md | Documents new ROCm path resolution behavior and removal of --rocm-path. |
| .gitignore | Ignores run_directory/ output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Clear any ROCM_PATH left from a previous model run so finalize always | ||
| # re-resolves per docker_image (OCI config → probe → default). | ||
| # ROCM_PATH is program-managed; users override via MAD_ROCM_PATH in | ||
| # additional_context.docker_env_vars, which was merged just above and | ||
| # will be consumed by finalize_container_rocm_path correctly. | ||
| self.context.ctx["docker_env_vars"].pop("ROCM_PATH", None) | ||
|
|
| host_rocm_root = getattr(context, "_rocm_path", None) or "unknown" | ||
| host_install_type = ( | ||
| "therock" | ||
| if is_therock_tree(Path(host_rocm_root)) | ||
| else "apt install" | ||
| ) |
| if existing not in (None, ""): | ||
| croot = normalize_rocm_path(str(existing).strip()) |
| - **Additional context (container):** `"docker_env_vars": { "MAD_ROCM_PATH": "/path/inside/image" }` — sets the in-container `ROCM_PATH` for Docker runs. If omitted, at `run` time madengine uses the image OCI `Env` (`ROCM_PATH` / `ROCM_HOME`) if present, then an in-container probe, then defaults to `/opt/rocm`. The host-resolved path is **not** mirrored into the container. | ||
|
|
||
| These two keys are independent, allowing host and container to use different ROCm installations without confusion. | ||
|
|
||
| Precedence (host): top-level `MAD_ROCM_PATH` → auto-detect (unless disabled) → `ROCM_PATH` → `/opt/rocm`. | ||
|
|
||
| Precedence (container, **local Docker `run`**, **AMD**): `docker_env_vars.MAD_ROCM_PATH` (maps to `ROCM_PATH` for the workload) or explicit `ROCM_PATH` in `docker_env_vars` → image OCI `Env` (`ROCM_PATH` / `ROCM_HOME`) → in-image probe → default `/opt/rocm` with a warning. Implemented in `ContainerRunner.run_container` after the run image is resolved. |
Summary
Adds automatic ROCm install root resolution for the run phase, with independent host and container path handling.
What changed
madengine.utils.rocm_path_resolverandmadengine.utils.therock_markers(shared TheRockshare/therockfile markers).MAD_ROCM_PATHin additional context → auto-detect (traditional/opt/rocm, versioned/opt/rocm-*, TheRockrocm-sdkmarkers,rocminfo/amd-smi/rocm-smionPATH) →ROCM_PATHenv var →/opt/rocm. SetMAD_AUTO_ROCM_PATH=0to skip scanning.docker_env_vars.MAD_ROCM_PATHsets in-containerROCM_PATH(key consumed, not forwarded as-is); else resolved from OCI image config (docker image inspect) → in-image shell probe (docker run --rm) → default/opt/rocmwith warning. The host-resolved path is not mirrored into the container by default.--rocm-pathCLI flag: was an alias forMAD_ROCM_PATHbut implied it covered both host and container paths. Use--additional-contextinstead:{"MAD_ROCM_PATH": "/path/to/host/rocm"}{"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}therock_detectorscript: prepends the package parent tosys.pathso imports work when run as a file.test_rocm_path,test_therock_markers) and integration GPU test adjustments.How to test
pytest tests/unit/test_rocm_path.py tests/unit/test_therock_markers.py(or fulltests/unit/).madengine run --help— confirm--rocm-pathno longer appears;MAD_ROCM_PATHis set via--additional-context.madengine runwithoutMAD_ROCM_PATHand confirm a sensibleROCM_PATHin the generated context.Checklist
--rocm-pathCLI flag removed; interface isMAD_ROCM_PATHin--additional-context.MAD_ROCM_PATH: null/blank indocker_env_varstreated as "unset" — host path not mirrored.MAD_AUTO_ROCM_PATH=0beyond documented legacy path.